Statistical Parsing by Machine Learning from a Classical Arabic Treebank

نویسنده

Kais Dukes

چکیده

Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic. Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (ةاغعإ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations. A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic. The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year. ِيحِ ه رم ٱ نِػٰ َ حْْ ه رم ٱ ِ ه للَّ ٱ مِسْبِ ُيكِحَْمإ يُلِعَْمإ تَهٱَ مَه ه ِ إ اَنتَمْه لَع امَ ه لَ ِ إ اَنَم َ لْْعِ لََ مََهاحَبْ ُ س „Glory be to thee! We have no knowledge except what you have taught us. Indeed it is you who is the all-knowing, the all-wise.‟ A prayer of the angels –The Quran, verse (2:32)

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

The Treebank is one of the most useful resources for supervised or semi-supervised learning in many NLP tasks such as speech recognition, spoken language systems, parsing and machine translation. Treebank can be developded in different ways that could be, generally, categorized in manually and statistical approaches. While the resulted Treebank in each of these methods has the annotation error,...

متن کامل

Supervised learning model for parsing Arabic language

Parsing the Arabic language is a difficult task given the specificities of this language and given the scarcity of digital resources (grammars and annotated corpora). In this paper, we suggest a method for Arabic parsing based on supervised machine learning. We used the SVMs algorithm to select the syntactic labels of the sentence. Furthermore, we evaluated our parser following the cross valida...

متن کامل

ARSYPAR: A tool for parsing the Arabic language based on supervised learning

In this paper, we present a tool for parsing the Arabic language based on supervised machine learning. The used algorithm for the learning phase is the support vector machine. We also used the Penn Arabic Treebank as a learning corpus. Furthermore, we evaluated our parser following the cross validation method. The obtained results are very encouraging. We give at the end our vision to ameliorat...

متن کامل

Statistical Dependency Parsing of Four Treebanks

Multilingual dependency parsing is gaining popularity in recent years for several reasons. Dependency structures are more adequate for languages with freer word order than the traditional constituency notion. There is a growing availability of dependency treebanks for new languages. Broad coverage statistical dependency parsers are available and easily portable to new languages. Dependency pars...

متن کامل

Why is German Dependency Parsing More Reliable than Constituent Parsing?

In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used [ , ]. Another direction ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1510.07193 شماره

صفحات -

تاریخ انتشار 2013

Statistical Parsing by Machine Learning from a Classical Arabic Treebank

نویسنده

چکیده

منابع مشابه

تصحیح خودکار خطا در درخت بانک نحوی با استفاده از یادگیری ماشینی انتقال محور

Supervised learning model for parsing Arabic language

ARSYPAR: A tool for parsing the Arabic language based on supervised learning

Statistical Dependency Parsing of Four Treebanks

Why is German Dependency Parsing More Reliable than Constituent Parsing?

عنوان ژورنال:

اشتراک گذاری